🚀 Pipeline Results Report

Generated: /root/verbalize-sampling

Experiments: 4

Metrics: diversity, ngram, joke_quality

📋 Experiment Configurations

NameTaskMethodModelResponsesTemperature
direct (samples=1)jokedirectmeta-llama/llama-3.1-70b-instruct300.7
sequence [strict] (samples=5)jokesequencemeta-llama/llama-3.1-70b-instruct300.7
multi_turn [strict] (samples=5)jokemulti_turnmeta-llama/llama-3.1-70b-instruct300.7
structure_with_prob [strict] (samples=5)jokestructure_with_probmeta-llama/llama-3.1-70b-instruct300.7

📝 Generation Examples

direct (samples=1)

Method: directTask: joke
Prompt:
Tell me a joke.
Response:
A man walked into a library and asked the librarian, "Do you have any books on Pavlov's dogs and Schrödinger's cat?" The librarian replied, "It rings a bell, but I'm not sure if it's here or not." The man laughed and said, "Well, that's a real paws-ibility!" The librarian just rolled her eyes and muttered, "Cat-astrophic puns are the worst." The man grinned and said, "Hey, I'm just trying to make a purr-fect joke!" The librarian sighed and handed him a book, saying, "Here, take this and paws for a moment. Maybe it'll help you come up with a joke that's the cat's meow."
Method: directTask: joke
Prompt:
Tell me a joke.
Response:
A man walked into a library and asked the librarian, "Do you have any books on Pavlov's dogs and Schrödinger's cat?" The librarian replied, "It rings a bell, but I'm not sure if it's here or not."
Method: directTask: joke
Prompt:
Tell me a joke.
Response:
A man walked into a library and asked the librarian, "Do you have any books on Pavlov's dogs and Schrödinger's cat?" The librarian replied, "It rings a bell, but I'm not sure if it's here or not."

sequence [strict] (samples=5)

Method: sequenceTask: joke
Prompt:
Tell me a joke.
Response:
A man walked into a library and asked the librarian, 'Do you have any books on Pavlov's dogs and Schrödinger's cat?' The librarian replied, 'It rings a bell, but I'm not sure if it's here or not.'
Method: sequenceTask: joke
Prompt:
Tell me a joke.
Response:
Why couldn't the bicycle stand up by itself? Because it was two-tired!
Method: sequenceTask: joke
Prompt:
Tell me a joke.
Response:
A man walked into a library and asked the librarian, 'Do you have any books on Pavlov's dogs and Schrödinger's cat?' The librarian replied, 'It rings a bell, but I'm not sure if it's here or not.'

multi_turn [strict] (samples=5)

Method: multi_turnTask: joke
Prompt:
Make a joke about coffee
Response:
Why did the coffee file a police report? Because it got mugged!
Method: multi_turnTask: joke
Prompt:
Make a joke about coffee
Response:
Why did the coffee go to therapy? Because it was feeling a little bitter!
Method: multi_turnTask: joke
Prompt:
Make a joke about coffee
Response:
Why did the coffee go to the doctor? Because it was feeling a little jittery!

structure_with_prob [strict] (samples=5)

Method: structure_with_probTask: joke
Prompt:
Tell me a joke.
Response:
{'text': 'Why don't scientists trust atoms? Because they make up everything.', 'probability': 0.4}
Method: structure_with_probTask: joke
Prompt:
Tell me a joke.
Response:
:{
Method: structure_with_probTask: joke
Prompt:
Tell me a joke.
Response:
{'text': 'A man walked into a library and asked the librarian, "Do you have any books on Pavlov's dogs and Schrödinger's cat?" The librarian replied, "It rings a bell, but I'm not sure if it's here or not."', 'probability': 0.3}

📊 Results Summary

Diversity Results

Note: Detailed arrays excluded from table: pairwise_diversities
ExperimentAverage DiversityMin DiversityMax DiversityStd DiversityAverage Response LengthAverage Unique WordsAverage Vocabulary RichnessTotal Cost
direct (samples=1)0.07110.00000.44830.101189.040051.14000.83930.0003
sequence [strict] (samples=5)0.27330.00000.48500.131018.211516.55130.95800.0001
multi_turn [strict] (samples=5)0.22340.00000.48990.099670.693352.14000.79080.0003
structure_with_prob [strict] (samples=5)0.22710.00000.44090.131015.035113.71050.97240.0001

📈 Diversity Visualizations

Aggregate Metrics
Aggregate Metrics
Pairwise Diversities Distribution
Pairwise Diversities Distribution
Response Length Distribution
Response Length Distribution
Unique Words Distribution
Unique Words Distribution
Vocabulary Richness Distribution
Vocabulary Richness Distribution

Ngram Results

Note: Detailed arrays excluded from table: pairwise_rouge_l_scores
ExperimentAverage Rouge LMin Rouge LMax Rouge LStd Rouge LAverage Response Length
direct (samples=1)0.61670.00001.00000.368389.0400
sequence [strict] (samples=5)0.29970.00001.00000.303318.2115
multi_turn [strict] (samples=5)0.25150.00001.00000.210970.6933
structure_with_prob [strict] (samples=5)0.23390.00001.00000.267215.0351

📈 Ngram Visualizations

Aggregate Metrics
Aggregate Metrics
Pairwise Rouge L Scores Distribution
Pairwise Rouge L Scores Distribution
Response Length Distribution
Response Length Distribution

Joke_Quality Results

ExperimentFunninessClevernessOriginalityStructureOverallNormalized Overall
direct (samples=1)3.53334.14673.24673.90673.68530.7371
sequence [strict] (samples=5)3.30773.53212.42313.90383.27630.6553
multi_turn [strict] (samples=5)3.26003.87332.88673.80003.41970.6839
structure_with_prob [strict] (samples=5)2.44742.56141.80703.03512.43600.4872

📈 Joke Quality Visualizations

Aggregate Metrics
Aggregate Metrics
Cleverness.Score Distribution
Cleverness.Score Distribution
Funniness.Score Distribution
Funniness.Score Distribution
Originality.Score Distribution
Originality.Score Distribution
Structure.Score Distribution
Structure.Score Distribution

📈 Overall Comparison

📈 Comparison Chart Visualizations

Comparison Chart
Comparison Chart